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ABSTRACT 

A criterion for pruning parameters from N-gram backoff language 
models is developed, based on the relative entropy between the orig- 
inal and the pruned model. It is shown that the relative entropy re- 
sulting from pruning a single N-gram can be computed exactly and 
efficiently for backoff models. The relative entropy measure can be 
expressed as a relative change in training set perplexity. This leads 
to a simple pruning criterion whereby all N-grams that change per- 
plexity by less than a threshold are removed from the model. Exper- 
iments show that a production-quality Hub4 LM can be reduced to 
26% its original size without increasing recognition error. We also 
compare the approach to a heuristic pruning criterion by Seymore 
and Rosenfeld [Bp, and show that their approach can be interpreted 
as an approximation to the relative entropy criterion. Experimen- 
tally, both approaches select similar sets of N-grams (about 85% 
overlap), with the exact relative entropy criterion giving marginally 
better performance. 

1. Introduction 

N-gram backoff models [g], despite their shortcomings, 
still dominate as the technology of choice for state-of-the- 
art speech recognizers [|J. Two sources of performance 
improvements are the use of higher-order models (several 
DARPA-Hub4 sites now use 4-gram or 5-gram models) and 
the inclusion of more training data from more sources (Hub4 
models typically include Broadcast News, NABN and WSJ 
data). Both of these approaches lead to model sizes that are 
impractical unless some sort of parameter selection technique 
is used. In the case of N-gram models, the goal of parame- 
ter selection is to chose which N-grams should have explicit 
conditional probability estimates assigned by the model, so 
as to maximize performance (i.e., minimize perplexity and/or 
recognition error) while minimizing model size. As pointed 
out in [Q, pruning (selecting parameters from) a full N-gram 
model of higher order amounts to building a variable-length 
N-gram model, i.e., one in which training set contexts are not 
uniformly represented by N-grams of the same length. 

Seymore and Rosenfeld [^| showed that selecting N-grams 
based on their conditional probability estimates and fre- 
quency of use is more effective than the traditional absolute 
frequency thresholding. In this paper we revisit the problem 
of N-gram parameter selection by deriving a criterion that sat- 
isfies the following desiderata. 



• Soundness: The criterion should optimize some well- 
understood information-theoretic measure of language 
model quality. 

• Efficiency: An N-gram selection algorithm should be 
fast, i.e., take time proportional to the number of N- 
grams under consideration. 

• Self-containedness: As a practical consideration, we 
want to be able to prune N-grams from existing language 
models. This means a pruning criterion should be based 
only on information contained in the model itself. 

In the remainder of this paper we describe our pruning algo- 
rithm based on relative entropy distance between N-gram dis- 
tributions (Section |2j), investigate how the quantities required 
for the pruning criterion can be obtained efficiently and ex- 
actly (Section show that the criterion is highly effective 
in reducing the size of state-of-the-art language models with 
negligible performance penalties (Section Q), investigate the 
relation between our pruning criterion and that of Seymore 
and Rosenfeld (Section |5|), and draw some conclusions (Sec- 
tion^). 

2. N-gram Pruning Based on Relative 
Entropy 

An N-gram language model represents a probability distri- 
bution over words w, conditioned on (N — l)-tuples of pre- 
ceding words, or histories h. Only a finite set of N-grams 
(w, h) have conditional probabilities explicitly represented in 
the model. The remaining N-grams are assigned a probability 
by the recursive backoff rule 

p(w\h) = a(h)p(w\h ) 

where h! is the history h truncated by the first word (the one 
most distant from w), and a(h) is a backoff weight associated 
with history h, determined so that p(w\h) = 1. 

The goal of N-gram pruning is to remove explicit estimates 
p(w\h) from the model, thereby reducing the number of pa- 
rameters, while minimizing the performance loss. Note that 
after pruning, the retained explicit N-gram probabilities are 
unchanged, but backoff weights will have to be recomputed, 



thereby changing the values of implicit (backed-off) proba- 
bility estimates. Thus, the pruning approach chosen is con- 
ceptually independent of the estimator chosen to determine 
the explicit N-gram estimates. 

Since one of our goals is to prune N-gram models without 
access to any statistics not contained in the model itself, a 
natural criterion is to minimize the 'distance' between the 
distribution embodied by the original model and that of the 
pruned model. A standard measure of divergence between 
distributions is relative entropy or Kullback-Leibler distance 
(see, e.g., [Q]). Although not strictly a distance metric, it is a 
non-negative, continuous function that is zero if and only if 
the two distributions are identical. 

Let p(-\-) denote the conditional probabilities assigned by 
the original model, and p'{-\) the probabilities in the pruned 
model. Then, the relative entropy between the two models is 

D{p\\p) = - Piwi.h^ogp 1 (wi\h.j) - \ogp(w i \h j )\ 

Wi,hj 

(i) 

where the summation is over all words Wi and histories (con- 
texts) hj. 

Our goal will be to select N-grams for pruning such that 
D(p\\p') is minimized. However, it would not be feasible 
to maximize over all possible subsets of N-grams. Instead, 
we will assume that the N-grams affect the relative entropy 
roughly independently, and compute D(p\\p') due to each in- 
dividual N-gram. We can then rank the N-grams by their ef- 
fect on the model entropy, and prune those that increase rela- 
tive entropy the least. 

To choose pruning thresholds, it is helpful to look at a more 
intuitive interpretation of D(p\ \p') in terms of perplexity, the 
average branching factor of the language model. The per- 
plexity of the original model (evaluated on the distribution it 
embodies) is given by 

pp _ e ~ ^2 h m p( h ^ w ) y °^v(w\h) 

whereas the perplexity of the pruned model on the original 
distribution is 

ppl _ g^Eft w P(h,iu)logp' '{w\h) 

The relative change in model perplexity can now be expressed 
in terms of relative entropy: 

PP ' Z PP = e D( P \\p') _ i 
PP 

This suggests a simple thresholding algorithm for N-gram 
pruning: 

1 . Select a threshold 9. 

2. Compute the relative perplexity increase due to pruning 
each N-gram individually. 



3. Remove all N-grams that raise the perplexity by less than 
6, and recompute backoff weights. 

Relation to Other Work Our choice of relative entropy 
as an optimization criterion is by no means new. Relative 
entropy minimization (sometimes in the guise of likelihood 
maximization) is the basis of many model optimization tech- 
niques proposed in the past, e.g., for text compression |[l|], 
Markov model induction 0], Kneser [||] first suggested 
applying it to backoff N-gram models, although, as shown in 
Section the heuristic pruning algorithm of Seymore and 
Rosenfeld |J amounts to an approximate relative entropy 
minimization. The algorithm described in the next section 
is novel in that it removes some of the approximations em- 
ployed in previous approaches. Specifically, the algorithm of 
[^] assumes that backoff weights are unchanged by the prun- 
ing, and does not consider the effect that a changed back- 
off weight has on N-gram probabilities other than the pruned 
one (this effect is discussed in more detail in Section ||). 

The main approximation that remains in our algorithm is the 
greedy aspect: we do not consider possible interactions be- 
tween selected N-grams, and prune based solely on relative 
entropy due to removing a single N-gram, so as to avoid 
searching the exponential space of N-gram subsets. 

3. Computing Relative Entropy 

We now show how the relative entropy D(p\ \p') due to prun- 
ing a single N-gram parameter can be computed exactly and 
efficiently. Consider the effect of removing an N-gram con- 
sisting of history h and word w. This entails two changes to 
the probability estimates. 

• The backoff weight a(h) associated with history h is 
changed, affecting all backed-off estimates involving 
history h. We use the notation BO(u>i, h) to denote this 
case, i.e., that the original model does not contain an 
explicit N-gram estimate for (wi,h). Let a(h) be the 
original backoff weight, and a'(h) the backoff weight in 
the pruned model. 

• The explicit estimate p(w\h) is replaced by a backoff 
estimate 

p'(w\h) = a'(h)p(w\ti) 

where h' is the history obtained by dropping the first 
word in h. 

All estimates not involving history h remain unchanged, as 
do all estimates for which BO(wi, h) is not true. 

Substituting in ([j]), we get 

D(p\\p') = - ^2 P( w *> h)[logp'(wi\h) - logp(wi\h)\ 

Wi 

-p(w, h)[\ogp'(w\h) - \ogp(w\h)} 

Wi : BO(nii, h) 



= —p(h) ^p(w\h)[logp'(w\h) — log p(w\h)] 

+ ^2 p(wi\h)[logp'(wi\h) - logp{wi 

Wi : BO(wiJl) 

At first it seems as if computing D{p\\p') for a given In- 
gram requires a summation over the vocabulary, something 
that would be infeasible for large vocabularies and/or mod- 
els. However, by plugging in the terms for the backed-off 
estimates, we see that the sum can be factored so as to allow 
a more efficient computation. 

D(p\\p>) 

= —p(h) $yp(w\h) log p(w\h') + loga'(h) — logp(w\h)] 
+ Y,P(^i\h)[loga'(h)-loga(h)}} 

Wi : BO{wi,h) 

= —p(h) ^p(w\h)[logp(w\h / ) + loga'(h) — logp(w\h)] 
+ [log a' {h) - loga(/i)] P( w i\ h )} 

Wi ■ BO(wi, h) 

The sum in the last line represents the total probability mass 
given to backoff (the numerator for computing a(h)); it needs 
to be computed only once for each h, which is done efficiently 
by summing over all non-backoff estimates: 

^2 p(wi\h) = l- ^2 P( w *\ h ) 

Wi-.BO(Wi,h) Wi-.^BO(wi,h) 

The marginal history probabilities p(h) are obtained by mul- 
tiplying conditional probabilities p(hi)p(h,2 \ hi) . . .. 

Finally, we need to be able to compute the revised backoff 
weights a'(h) efficiently, i.e., in constant time per N-gram. 
Recall that 

= 1 -T, Wi :^BO( Wi ,h)P(. w i\ h ) 
1 -T, Wi ^BO(w i ,h)P( W i\ h ') 

a'(h)is obtained by dropping the term for the pruned N-gram 
(w, h) from the summation in both numerator and denomina- 
tor. Thus, we compute the original numerator and denomi- 
nator once per history h, and then add p(w\h) and p(w\h'), 
respectively, to obtain a'(h) for each pruned w. 

4. Experiments 

We evaluated relative entropy-based language model pruning 
in the Broadcast News domain, using SRI's 1996 Hub4 eval- 
uation system [H]. N-best lists generated with a bigram lan- 
guage model were rescored with various pruned versions of a 
large four-gram language model.[] 

1 We used the 1996 system, partly due to time constraints, partly be- 
cause the 1997 system generated N-best lists using a large trigram language 
model, which makes rescoring experiments with smaller language models 
less meaningful. 
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Table 1 : Perplexity (PP) and word error rate (WER) as a 
function of pruning threshold and language model sizes. 

As noted in Section ||, the pruning algorithm is applicable ir- 
respective of the particular N-gram estimator used. We used 
Good-Turing smoothing [|]| throughout and did not investi- 
gate possible interactions between smoothing methods and 
pruning. 

Table |Tj shows model size, perplexity and word error results as 
determined on the development test set, for various pruning 
thresholds. The first and last rows of the table give the per- 
formance of the full four-gram and the pure trigram model, 
respectively. Note that perplexity here refers to the indepen- 
dent test set, not to the training set perplexity that underlies 
the pruning criterion. 

As shown, pruning is highly effective. For 8 — 10~ 8 , we ob- 
tain a model that is 26% the size of the original model with- 
out degradation in recognition performance and less than 6% 
perplexity increase. Comparing the pruned four-gram model 
to the full trigram model, we see that it is better to include 
non-redundant four-grams than to use a much larger number 
of trigrams. The pruned (6 = 10~ 8 ) four-gram has the same 
perplexity and lower word error (p < 0.07) than the full tri- 
gram. 

5. Comparison to Seymore and Rosenfeld's 
Approach 

In [[)]], Seymore and Rosenfeld proposed a different pruning 
scheme for backoff models (henceforth called the "SR crite- 
rion," as opposed to the relative entropy, or "RE criterion"). 
In the SR approach, N-grams are ranked by a weighted differ- 
ence of the log probability estimate before and after pruning, 

N(w, h)[logp(w\h) - logp'(w\h)} (3) 

where N(w, h) is the discounted frequency with which N- 
gram (w, h) was observed in training. Comparing (|3j) with 
the expansion of D(p\\p') in (Q), we see that the two crite- 
ria are related. First, we can assume that N(w, h) is roughly 
proportional to p(w, h), so for ranking purposes the two are 
equivalent. The difference of the log probabilities in ^ cor- 
responds to the same quantity in (||). Thus, the major dif- 
ference between the two approaches is that the SR criterion 
does not include the effect on N-grams other than the one be- 
ing considered, namely, those due to changes in the backoff 
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No. Trigrams 


No. shared trigrams 


1000 


883 


10000 


8721 


100000 


85599 


1000000 


852016 



Table 2: Comparison of Seymore and Rosenfeld (SR) and 
Relative Entropy (RE) pruning: perplexities as a function of 
the number of trigrams. 

weight a(h). 

To evaluate the effect of ignoring backed-off estimates in the 
pruning criterion we compared the performance of the SR and 
the RE criterion on the Broadcast News development test set, 
using the same N-best rescoring system as described before. 
To make the methods comparable we adopted Seymore and 
Rosenfeld's approach of ranking the N-grams according to 
the criterion in question, and to retain a specified number of 
N-grams from the top of the ranked list. For the sake of sim- 
plicity we used a trigram-only version of the Hub4 language 
model used earlier, and restricted pruning to trigrams. 

We also verified that the discounted frequency N(w, h) in 
(H) could be replaced with the model's N-gram probability 
p(w, h) without changing the ranking significantly: over 99% 
of the chosen N-grams were the same. This means the SR cri- 
terion can also be based entirely on information in the model 
itself, making it more convenient for model post-processing. 

Tables || and [}] show model perplexity and word error rates, 
respectively, for the two pruning methods as a function of the 
number of trigrams in the model. In terms of perplexity, we 
see a very small, albeit consistent, advantage for the relative 
entropy method, as expected given the optimized criterion. 
However, the difference is negligible when it comes to recog- 
nition performance, where results are identical or differ only 
non-significantly. We can thus conclude that, for practical 
purposes, the SR criterion is a very good approximation to 
the RE criterion. 
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Table 3: Comparison of Seymore and Rosenfeld (SR) and 
Relative Entropy (RE) pruning: word error rate as a function 
of the number of trigrams. 



Table 4: Overlap of selected trigrams between SR and RE 
methods. 

Finally, we looked at the overlap of the N-grams chosen by 
the two criteria, shown in Table f|. The percentage of common 
trigrams ranges from 88.3% to 85.2%, and seems to decrease 
as the model size increases. We can expect the most frequent 
N-grams to be among those that are shared, making it no sur- 
prise that both methods perform so similarly. 

6. Conclusions 

We developed an algorithm for N-gram selection for backoff 
N-gram language models, based on minimizing the relative 
entropy between the full and the pruned model. Experiments 
show that the algorithm is highly effective, eliminating all but 
26% of the parameters in a Hub4 four-gram model without 
significantly affecting performance. The pruning criterion of 
Seymore and Rosenfeld is seen to be an approximate version 
of the relative entropy criterion; empirically, the two methods 
perform about the same. 
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Erratum 

The published paper had an error in the second equation for 
D(p\ \p') in Section^. In two instances, the quantity log a'(h) 
had been mistakenly typeset as log a(h'). Also, the informa- 
tion in reference m\ was incorrect. 



